-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let more encodings through and allow registration of encodings #176
Conversation
@mwiebe was there any particular reason that encodings were limited to |
It was allowing ascii, utf-8, and some other similar ones. The main reason to limit to a pretty strict and small set is to keep it easier to support as a standard interface, and catch errors as early as possible. Think of the history of html with IE6, and how liberal it was supporting very arbitrary input as HTML. |
My concern is that not every system spells these encodings in the same way and supports many additional encodings that won't conform to the existing set. MySQL is a good example of this. So, instead of special casing every system, I think it would make sense to allow Python encodings, and a way for users to register their own encodings. |
This sounds like a similar argument to "System X spells 32-bit integers as 'integer32' instead of 'int32', we should extend datashape to support that." Extending datashape to allow arbitrary encoding strings would seem to be the opposite of what datashape is for, helping systems talk to each other. Isn't the point of datashape to translate different system's type systems into a common language? |
@mwiebe In this PR: libdynd/libdynd#417, it seems like the following is inconsistent with what you're saying above:
Presumably the thing you're using to do the adapting would need to know what encoding to adapt to. Where would that information be stored? What would the datashape of the conversion expression be? |
@cpcloud any unicode encoding (except UCS2) can represent all strings, so choosing a particular one as the default representation for all normal string interaction is very good to simplify implementation to one string type that can get all the usability and performance development focus. Yes, a string adaptor type would need to know the encoding. It also needs to know whether it's null-terminated, whether it's a fixed memory buffer or variable, etc, because there are a huge number of string storage approaches out there. All of these details would have to be parameters somehow to the adaptor type. I don't know what the datashape of the conversion expression should be, but however it's spelled, I would expect DyND to use an adaptor type under the hood which converts to/from that string storage for any computation. |
Very keen to see this merged as I'm getting a ValueError when trying to use blaze with a SQLServer db
|
Ping! Just linking some issues which demonstrate that this is an actual problem which people are having. This still affects me but I've been quiet because I've resorted to monkey-patching my own version of blaze. |
Will this be worked on? We really need it... |
I can take a look at this. At first glance the aliases seem to be currently supported by DyND-Datashape. |
I think latin1 is such a common encoding it would make sense to have that be one of the canonical encodings. I think you'd still need the ability to register encodings as there may be many non-standard names out there, in the wild. |
@mwiebe @dhirschfeld @cpcloud Perhaps the DyND-Datashape syntax for general type constructors could also work for unknown encodings. Example: >>> ndt.type("Latin1_General_CI_AS[string]")
ndt.type('Latin1_General_CI_AS[string]')
>>>
>>> ndt.type("Latin1_General_CI_AS[bytes]")
ndt.type('Latin1_General_CI_AS[bytes]') DyND could treat such strings as an opaque blob tagged with the constructor, Blaze could do more. But they'd have the same (existing) syntax. |
@skrah the syntax will accept that, definitely, but it doesn't really fit in the semantics defined there. Those types are pattern types with the particular type variable names, that should match any concrete type constructable with those arguments. e.g. |
@mwiebe Interesting, I've always viewed this as a tag or type constructor. Indeed this matches: >>> ndt.type("Rational[(int64, int64)]").match(ndt.type("(int64, int64)"))
True What can you do if you don't want custom types to be matched against concrete types? |
please add this functionality. Is there a workaround to use unsupported encodings with the current version or do we have to wait? |
Hi, Can we get this pull request through? |
@ivandir, I'm not working on datashape anymore. Sorry this lingered for so long. You're welcome to pull this down and try to merge it in to your local branch and work off of that. I don't have the bandwidth to take this up right now. |
closes #173